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Abstract 

Distributed stateful stream processing enables the de¬ 
ployment and execution of large scale continuous com¬ 
putations in the cloud, targeting both low latency and 
high throughput. One of the most fundamental chal¬ 
lenges of this paradigm is providing processing guaran¬ 
tees under potential failures. Existing approaches rely 
on periodic global state snapshots that can be used 
for failure recovery. Those approaches suffer from two 
main drawbacks. First, they often stall the overall com¬ 
putation which impacts ingestion. Second, they ea¬ 
gerly persist all records in transit along with the oper¬ 
ation states which results in larger snapshots than re¬ 
quired. In this work we propose Asynchronous Bar¬ 
rier Snapshotting (ABS), a lightweight algorithm suited 
for modern dataflow execution engines that minimises 
space requirements. ABS persists only operator states 
on acyclic execution topologies while keeping a min¬ 
imal record log on cyclic dataflows. We implemented 
ABS on Apache Flink, a distributed analytics engine 
that supports stateful stream processing. Our evalua¬ 
tion shows that our algorithm does not have a heavy 
impact on the execution, maintaining linear scalability 
and performing well with frequent snapshots. 

Keywords fault tolerance, distributed computing, stream 
processing, dataflow, cloud computing, state manage¬ 
ment 

1. Introduction 

Distributed dataflow processing is an emerging paradigm 
for data intensive computing which allows continuous 
computations on data in high volumes, targeting low 
end-to-end latency while guaranteeing high through¬ 
put. Several time-critical applications could benefit 


from dataflow processing systems such as Apache 
Flink [1] and Naiad [11], especially in the domains of 
real-time analysis (e.g. predictive analytics and com¬ 
plex event processing). Fault tolerance is of paramount 
importance in such systems, as failures cannot be af¬ 
forded in most real-world use cases. Currently known 
approaches that guarantee exactly-once semantics on 
stateful processing systems rely on global, consistent 
snapshots of the execution state. However, there are 
two main drawbacks that make their application inef¬ 
ficient for real-time stream processing. Synchronous 
snapshotting techniques stop the overall execution of a 
distributed computation in order to obtain a consistent 
view of the overall state. Furthermore, to our knowl¬ 
edge all of the existing algorithms for distributed snap¬ 
shots include records that are in transit in channels or 
unprocessed messages throughout the execution graph 
as part of the snapshotted state. Most often this includes 
state that is larger than required. 

In this work, we focus on providing lightweight 
snapshotting, specifically targeted at distributed state¬ 
ful dataflow systems, with low impact on performance. 
Our solution provides asynchronous state snapshots 
with low space costs that contain only operator states 
in acyclic execution topologies. Additionally, we cover 
the case of cyclic execution graphs by applying down¬ 
stream backup on selected parts of the topology while 
keeping the snapshot state to minimum. Our technique 
does not halt the streaming operation and it only intro¬ 
duces a small runtime overhead. The contributions of 
this paper can be summarised as follows: 

• We propose and implement an asynchronous snap¬ 
shotting algorithm that achieves minimal snapshots 
on acyclic execution graphs. 


• We describe and implement a generalisation of our 
algorithm that works on cyclic execution graphs. 

• We show the benefits of our approach compared to 
the state-of-the-art using Apache Flink Streaming as 
a base system for comparisons. 

The rest of the paper is organised as follows: Section 2 
gives an overview of existing approaches for distributed 
global snapshots in stateful dataflow systems. Section 
3 provides an overview of the Apache Flink processing 
and execution model followed by Section 4 where we 
describe our main approach to global snapshotting in 
detail. Our recovery scheme is described briefly in Sec¬ 
tion 5. Finally, Section 6 summarises our implementa¬ 
tion followed by our evaluation in Section 7 and future 
work and conclusion in Section 8. 

2. Related Work 

Several recovery mechanisms have been proposed dur¬ 
ing the last decade for systems that do continuous pro¬ 
cessing [4, 11]. Systems that emulate continuous pro¬ 
cessing into stateless distributed batch computations 
such as Discretized Streams and Comet [6, 15] rely 
on state recomputation. On the other hand, stateful 
dataflow systems such as Naiad, SDGs, Piccolo and 
SEEP [3, 5, 11, 12] , which are our main focus in this 
work, use checkpointing to obtain consistent snapshots 
of the global execution for failure recovery. 

The problem of consistent global snapshots in dis¬ 
tributed environments, as introduced by Chandy and 
Lamport [4], has been researched extensively through¬ 
out the last decades [4, 7, 8]. A global snapshot the¬ 
oretically reflects the overall state of an execution, or 
a possible state at a specific instance of its operation. 
A simple but costly approach employed by Naiad [11] 
is to perform a synchronous snapshot in three steps: 
first halting the overall computation of the execution 
graph, then performing the snapshot and finally in¬ 
structing each task to continue its operation once the 
global snapshot is complete. This approach has a high 
impact on both throughput and space requirements due 
to the need to block the whole computation, while also 
relying on upstream backup that logs emitted records 
at the producer side. Another popular approach, origi¬ 
nally proposed by Chandy and Lamport [4], that is de¬ 
ployed in many systems today is to perform snapshots 
asynchronously while eagerly doing upstream backup 
[4, 5, 10]. This is achieved by distributing markers 


throughout the execution graph that trigger the per¬ 
sistence of operator and channel state. This approach 
though still suffers from additional space requirements 
due to the need of an upstream backup and as a re¬ 
sult higher recovery times caused by the reprocess¬ 
ing of backup records. Our approach extends the orig¬ 
inal asynchronous snapshotting idea of Chandy and 
Lamport, however, it considers no backup logging of 
records for acyclic graphs while also keeping very se¬ 
lective backup records on cyclic execution graphs. 

3. Background: The Apache Flink System 

Our current work is guided by the need for fault toler¬ 
ance on Apache Flink Streaming, a distributed stream 
analytics system that is part of the Apache Flink Stack 
(former Stratosphere [2]). Apache Flink is architec- 
tured around a generic runtime engine uniformly pro¬ 
cessing both batch and streaming jobs composed of 
stateful interconnected tasks. Analytics jobs in Flink 
are compiled into directed graphs of tasks. Data el¬ 
ements are fetched from external sources and routed 
through the task graph in a pipelined fashion. Tasks are 
continuously manipulating their internal state based on 
the received inputs and are generating new outputs. 

3.1 The Streaming Programming Model 

The Apache Flink API for stream processing allows 
the composition of complex streaming analytics jobs 
by exposing unbounded partitioned data streams (par¬ 
tially ordered sequences of records) as its core data 
abstraction, called DataStreams. DataStreams can be 
created from external sources (e.g. message queues, 
socket streams, custom generators) or by invoking op¬ 
erations on other DataStreams. DataStreams support 
several operators such as map, filter and reduce in the 
form of higher order functions that are applied incre¬ 
mentally per record and generate new DataStreams. 
Every operator can be parallelised by placing parallel 
instances to run on different partitions of the respec¬ 
tive stream, thus, allowing the distributed execution of 
stream transformations. 

The code example in 1 shows how to implement a 
simple incremental word count in Apache Flink. In this 
program words are read from a text file and the current 
count for each word is printed to the standard output. 
This is a stateful streaming program as sources need to 
be aware of their current file offset and counters need 



Figure 1: The execution graph for incremental word 
count 

to maintain the current count for each word as their 
internal state. 

val env : StreamExecutionEnvironment = ... 
env.setParallelism(2) 

val wordStream = env.readTextFile(path) 
val countStream = wordStream.groupBy(_).count 
countStream.print 

Example 1: Incremental Word Count 

3.2 Distributed Dataflow Execution 

When a user executes an application all DataStream 
operators compile into an execution graph that is in 
principle a directed graph G = (T^E), similarly to Na¬ 
iad [11] where vertices T represent tasks and edges E 
represent data channels between tasks. An execution 
graph is depicted in Fig. 1 for the incremental word 
count example. As shown, every instance of an opera¬ 
tor is encapsulated on a respective task. Tasks can be 
further classified as sources when they have no input 
channels and sinks when no output channels are set. 
Furthermore, M denotes the set of all records trans¬ 
ferred by tasks during their parallel execution. Each 
task t encapsulates the independent execution of an 
operator instance and is composed of the following: (1) 
a set of input and output channels: 4, C E; (2) an op¬ 
erator state St and (3) a user defined function (UDF) ft. 
Data ingestion is pull-based : during its execution each 
task consumes input records, updates its operator state 
and generates new records according to its user defined 
function. More specifically, for each record r G M re¬ 
ceived by a task / G T a new state s[ is produced along 
with a set of output records D C M according to its 
UDF/,:^„r^(^;,D). 

4. Asynchronous Barrier Snapshotting 

In order to provide consistent results, distributed pro¬ 
cessing systems need to be resilient to task failures. 


A way of providing this resilience is to periodically 
capture snapshots of the execution graph which can 
be used later to recover from failures. A snapshot is 
a global state of the execution graph, capturing all nec¬ 
essary information to restart the computation from that 
specific execution state. 

4.1 Problem Deflnition 

We define a global snapshot G* = of an exe¬ 

cution graph G = (r,£’) as a set of all task and edge 
states, r* and E* respectively. In more detail, T* con¬ 
sists of all operator states G r*,V/ G T, while E* is 
a set of all channel states G E* where consists of 
records that are in transit on e. 

We require that certain properties hold for each snap¬ 
shot G* in order to guarantee correct results after recov¬ 
ery such as termination and feasibility as described by 
Tel [14]. 

Termination guarantees that a snapshot algorithm even¬ 
tually finishes in finite time after its initiation if all pro¬ 
cesses are alive. Eeasibility expresses the meaningful¬ 
ness of a snapshot, i.e. that during the snapshotting pro¬ 
cess no information has been lost regarding the com¬ 
putation. Formally, this implies that causal order [9] is 
maintained in the snapshot such that records delivered 
in tasks are also sent from the viewpoint of a snapshot. 

4.2 ABS for Acyclic Dataflows 

It is feasible to do snapshots without persisting channel 
states when the execution is divided into stages. Stages 
divide the injected data streams and all associated com¬ 
putations into a series of possible executions where all 
prior inputs and generated outputs have been fully pro¬ 
cessed. The set of operator states at the end of a stage 
reflectsg the whole execution history, therefore, it can 
be solely used for a snapshot. The core idea behind our 
algorithm is to create identical snapshots with staged 
snapshotting while keeping a continuous data inges¬ 
tion. 

In our approach, stages are emulated in a continu¬ 
ous dataflow execution by special barrier markers in¬ 
jected in the input data streams periodically that are 
pushed throughout the whole execution graph down 
to the sinks. Global snapshots are incrementally con¬ 
structed as each task receives the barriers indicating ex¬ 
ecution stages. We further make the following assump¬ 
tions for our algorithm: 
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Figure 2: Asynchronous barrier snapshots for acyclic graphs 


Algorithm 1 Asynchronous Barrier Snapshotting for 
Acyclic Execution Graphs 

1 : upon event {Init \ input ^channels, out¬ 
put-channels, fun, init-State) do 
2 : state init state', blocked-inputs 0 ; 

3: inputs := input -channels', 

4: outputs := output-Channels', udf := fun', 

5: 

6: upon event {receive \ input, {barrier)) do 
7: if input Nil then 

8: blocked-inputs := blocked-inputs U 

{input}', 

9: trigger (block | input); 

10: if blocked-inputs — inputs then 

11: blocked-inputs := 0; 

12: broadcast (send | outputs, {barrier))', 

13: trigger (snapshot | state); 

14: for each inputs as input 

15: trigger (unblock | input ); 

16: 

17: 

18: upon event {receive \ input, msg) do 

19: {state' ^out-records} udf {msg ^ state)', 

20: state := state'', 

21: for each out -records as {output ^ out-record} 

22: trigger (send | output, out_record); 

23: 

24: 


• Network channels are quasi-reliable, respect a FIFO 
delivery order and can be blocked and unblocked. 
When a channel is blocked all messages are buffered 
but not delivered until it gets unblocked. 

• Tasks can trigger operations on their channel com¬ 
ponents such as block, unblock and send messages. 


Broadcasting messages is also supported on all out¬ 
put channels. 

• Messages injected in source tasks (i.e. stage barri¬ 
ers) are resolved into a “Nil” input channel. 

The ABS Algorithm 1 proceeds as follows (depicted 
in Fig. 2): A central coordinator periodically injects 
stage barriers to all the sources. When a source receives 
a barrier it takes a snapshot of its current state, then 
broadcasts the barrier to all its outputs (Fig.2(a)). When 
a non-source task receives a barrier from one of its in¬ 
puts, it blocks that input until it receives a barrier from 
all inputs (line 9, Fig.2(b)). When barriers have been re¬ 
ceived from all the inputs, the task takes a snapshot of 
its current state and broadcasts the barrier to its outputs 
(lines 12-13, Fig.2(c)). Then, the task unblocks its input 
channels to continue its computation (line 15, Fig.2(d)). 
The complete global snapshot G* * = (r*,£'*) will con¬ 
sist solely of all operator states T* where = 0. 
Proof Sketch ^ : As mentioned earlier a snapshot algo¬ 
rithm should guarantee termination mid feasibility. 
Termination is guaranteed by the channel and acyclic 
execution graph properties. Reliability of the channels 
ensures that every barrier sent will eventually be re¬ 
ceived as long as the tasks are alive. Furthermore, as 
there is always a path from a source, every task in a 
DAG topology will eventually receive barriers from all 
its input channels and take a snapshot. 

For feasibility it suffices to show that the operator states 
in a global snapshot reflect only the history of records 
processed up to the last stage. This is guaranteed by 
the FIFO ordering property of the channels and the 
blocking of input channels upon barriers ensuring that 


^ We omit the formal proof of the algorithms in this version of the 
paper due to space limitations 
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Figure 3: Asynchronous barrier snapshots for cyclic graphs 


no post-shot records of a stage (records succeeding a 
barrier) are processed before a snapshot is taken. 

4.3 ABS for Cyclic Dataflows 

In the presence of directed cycles in the execution 
graph the ABS algorithm presented before would not 
terminate resulting into a deadlock, as tasks in a cycle 
would wait indefinitely to receive barriers from all their 
inputs. Additionally, records that are arbitrarily in tran¬ 
sit within cycles would not be included in the snapshot, 
thus violating feasibility. It is therefore needed to con¬ 
sistently include all records generated within a cycle 
in the snapshot for feasibility and to put these records 
back in transit upon recovery. Our approach to deal 
with cyclic graphs extends the basic algorithm with¬ 
out inducing any additional channel blocking as seen in 
Algorithm 2. First, we identify back-edges L on loops 
in the execution graph by static analysis. From control 
flow graph theory a back-edge in a directed graph is 
an edge that points to a vertex that has already been 
visited during a depth-first search. The execution graph 
G{T^E \L) is a DAG containing all tasks in the topol¬ 
ogy. From the perspective of this DAG the algorithm 
operates as before, nevertheless, we additionally apply 
downstream backup of records received from identi¬ 
fied back-edges over the duration of a snapshot. This 
is achieved by each task t that is a consumer of back- 
edges Lt G If creating a backup log of all records 
received from Lf from the moment it forwards barri¬ 
ers until receiving them back from Lf. Barriers push 
all records in transit within loops into the downstream 
logs, so they are included once in the consistent snap¬ 
shot. 

The ABS Algorithm 2 proceeds as follows (depicted 
in Fig. 3) in more detail: Tasks with back-edge inputs 
create a local copy of their state once all their regular 
{e ^ L) channels have delivered barriers (line 14, Fig. 3 
(b)). Moreover, from this point they log all records 


delivered from their back-edges until they receive stage 
barriers from them (line 26). This allows, as seen in Fig. 
3(c), for all pre-shot records that are in transit within 
loops to be included in the current snapshot. Mind that 
the final global snapshot G* = (r*,L*) contains all 
task states T* and only the back-edge records in transit 
L* C^*. 

Proof Sketch: Again, we need to prove that termina¬ 
tion diwA feasibility are guaranteed in this version of the 
algorithm. 

As in 4.2 termination is guaranteed, because every task 
will eventually receive barriers from all its inputs (in¬ 
cluding the back-end channels) and complete its snap¬ 
shot. By broadcasting the barrier as soon as receiving 
it from all regular inputs, we avoid the deadlock condi¬ 
tion mentioned previously. 

The FIFO ordering property still holds for back-edges 
and the following properties prowo feasibility . (1) Each 
task state included in the snapshot is a state copy of 
the respective task taken before processing any post¬ 
shot events from barriers received on regular inputs. 
(2) The downstream log included in the snapshot is 
complete and contains all pending post-shot records 
prior to barriers received on back-edges due to FIFO 
guarantees. 


5. Failure Recovery 

While not being the main focus of this work, a work¬ 
ing failure recovery scheme motivates the application 
of our snapshotting approach. Thus, we provide a brief 
explanation here regarding its operation. There are sev¬ 
eral failure recovery schemes that work with consis¬ 
tent snapshots. In its simplest form the whole execution 
graph can be restarted from the last global snapshot as 
such: every task / (1) retrieves from persistent storage 
its associated state for the snapshot Sf and sets it as its 
initial state, (2) recovers its backup log and processes 




















Algorithm 2 Asynchronous Barrier Snapshotting for 
Cyclic Execution Graphs 

1: upon event {Init \ input ^channels, 

backedge.channels, output ^channels, fun, 

init-State) do 

2: state init state', marked := 0; 

3: inputs := input ^channel s', logging := False', 

4: outputs output channels', udf fun', 

5: loopJnputs := backedgechannels', 

6: statesopy Nil', backup Jog []; 

7: 

8: upon event {receive \ input, {barrier)) do 
9: marked : = marked U {in put }; 

10: regular := inputs \ loopJnputs', 

11: if input Nil AND input ^ loopJnputs then 

12: trigger (block | input); 

13: if -^logging AND marked = regular then 

14: St ate-Copy state', logging True', 

15: broadcast (send | outputs, {barrier))', 

16: for each inputs as input 

17: trigger (unblock | input); 

18: 

19: if marked — input .channels then 

20: trigger (snapshot | {state_copy, 

backup Jog}); 

21: marked (d',logging := False', 

22: state-Copy := Nil', backup Jog := []; 

23: 

24: upon event {receive \ input, msg) do 

25: if logging AND node G loopJnputs then 

26: backup Jog := backup Jog :: [input]', 

27: {state' ^out .records} := udf {msg ^ state)', 

28: state state'', 

29: for each out.records as {output^ out.record} 

30: trigger (send | output, out_record); 

31: 

32: 


all contained records, (3) starts ingesting records from 
its input channels. 

A partial graph recovery scheme is also possible, 
similarly to TimeStream [13], by rescheduling only up¬ 
stream task dependencies (tasks that hold output chan¬ 
nels to the failed tasks) and their respective upstream 
tasks up to the sources. An example recovery plan is 
shown in Fig. 4. In order to offer exactly-once seman¬ 
tics, duplicate records should be ignored in all down- 



Figure 4: Upstream Dependency Recovery Scheme 


stream nodes to avoid recomputation. To achieve this 
we can follow a similar scheme to SDGs [5] and mark 
records with sequence numbers from the sources, thus, 
every downstream node can discard records with se¬ 
quence numbers less than what they have processed al¬ 
ready. 

6. Implementation 

We contributed the implementation of the ABS algo¬ 
rithm to Apache Flink in order to provide exactly-once 
processing semantics for the streaming runtime. In our 
current implementation blocked channels store all in¬ 
coming records on disk instead of keeping them in 
memory to increase scalability. While this technique 
ensures robustness, it increases the runtime impact of 
the ABS algorithm. 

In order to distinguish operator state from data we 
introduced an explicit OperatorState interface which 
contains methods for updating and checkpointing the 
state. We provided OperatorState implementations for 
the stateful runtime operators supported by Apache 
Flink such as offset based sources or aggregations. 

Snapshot coordination is implemented as an actor 
process on the job manager that keeps a global state 
for an execution graph of a single job. The coordinator 
periodically injects stage barriers to all sources of the 
execution graph. Upon reconfiguration, the last glob¬ 
ally snapshotted state is restored in the operators from 
a distributed in-memory persistent storage. 

7. Evaluation 

The goal of our evaluation is to compare the runtime 
overhead of ABS with the globally synchronised snap¬ 
shot algorithm employed in Naiad [11] and also to test 
the scalability of the algorithm to larger number of 
nodes. 

7.1 Experiment Setup 

The execution topology (Fig. 5) used for evaluation 
consists of 6 distinct operators with parallelism equal 
to the number of cluster nodes, which translates into a 









Figure 5: Execution topology 
used for evaluation 


Figure 6: Runtime impact comparison Figure 7: Runtime with ABS on dif- 
on varying snapshot intervals ferent cluster sizes 


6'^cluster size task vertices. The execution contains 3 
full network shuffles in order to accentuate the possi¬ 
ble impact of channel blocking in ABS. Sources gen¬ 
erate a total of 1 billion records which are distributed 
uniformly among the source instances. The state of the 
operators in the topology were per-key aggregates and 
source offsets. The experiments were run on Amazon 
EC2 cluster using up to 40 m3.medium instances. 

We measured the runtime overhead of our evalua¬ 
tion job running under different snapshotting schemes, 
namely ABS and synchronised snapshotting [11] with 
varying snapshot intervals. We implemented the syn¬ 
chronous snapshotting algorithm used in Naiad [11] on 
Apache Flink in order to have identical execution back¬ 
end for the comparison. This experiment was run using 
a 10 node cluster. To evaluate the scalability of our al¬ 
gorithm, we processed a fixed amount of input records 
(1 billion) while increasing the parallelism of our topol¬ 
ogy from 5 up to 40 nodes. 

7.2 Results 

In Fig. 6 we depict the runtime impact of the two al¬ 
gorithms against the baseline (no fault tolerance). The 
large performance impact of synchronous snapshotting 
is especially visible when the interval of snapshotting 
is small. That is due to the fact that the system spends 
more time not processing any data, in order to obtain 
global snapshots. ABS has a much lower impact on 
the runtime as it runs continuously without blocking 
the overall execution, while maintaining a rather sta¬ 
ble throughput rate. For larger snapshot intervals the 
impact of the synchronised algorithm is less signifi¬ 
cant since it operates in bursts (of 1-2 sec in our ex¬ 
periments) while letting the system run at its normal 
throughput during the rest of its execution. Neverthe¬ 
less, bursts can often violate SLAs in terms of real¬ 


time guarantees for many applications that are latency- 
critical, such as intrusion detection pipelines. Thus, 
such applications would further benefit by the perfor¬ 
mance of ABS. In Fig. 7 we compare the scalability 
of the topology running ABS with 3 second snapshot 
intervals against the baseline (without fault tolerance). 
It is clear that both the baseline job and ABS achieved 
linear scalability. 

8. Future Work and Conclusion 

In future work we are planning to explore the possi¬ 
bilities of further lowering the impact of ABS by de¬ 
coupling snapshotting state and operational state. That 
allows for purely asynchronous state management as 
tasks can continuously process records while persist¬ 
ing snapshots. In such scheme there is also the need 
to synchronise pre-shot and post-shot records with the 
respective state which can be resolved by marking the 
records depending on the snapshot they belong to. As 
this approach would increase the computational, space 
and network I/O requirements of the algorithm, we plan 
to compare its performance to our current ABS imple¬ 
mentation. Finally, we plan to investigate different re¬ 
covery techniques that maintain exactly-once seman¬ 
tics while minimising the need for reconfiguration by 
operating on a per-task granularity. 

In summary, we focused on the problem of perform¬ 
ing periodic global snapshots on distributed dataflow 
systems. We introduced ABS, a new snapshotting tech¬ 
nique that achieves good throughput. ABS is the first 
algorithm that considers the minimum state possible 
for acyclic execution topologies. Furthermore, we ex¬ 
tend ABS to work with cyclic execution graphs by 
storing only the records that need to be reprocessed 
upon recovery. We implemented ABS on Apache Flink 
and evaluated our approach against synchronous snap- 




































shotting. At this early stage ABS shows good results, 
having low impact on the overall execution throughput 
with linear scalability. 
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